{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction To Probabalistic Graph Models\n",
    "## Scott Hendrickson\n",
    "## 2016-Aug-19\n",
    "\n",
    "Requirements:\n",
    "    * numpy\n",
    "    * pandas\n",
    "    * libpgm\n",
    "    * networkx  (for plotting)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notes: This is a introductory survey. Unlike my favorite kind of RST, we won't build up models from first assumptions and small steps as much as try I will try to survey the basic concepts with examples, demonstrations and tie-ins to things with which you already are familiar."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why PGMs?\n",
    "When we started statistics, we talked about measurements, for example, of the height of students in you school, and noted that instances of these meansurements--events--fall into a probabiliy distribution. In this case, it was likely a Normal-looking distribution with many measurements around the average and a few measurements larger or smaller than average.\n",
    "\n",
    "This introduced the idea of a random variable.\n",
    "\n",
    "What do we when measurements of a random variable fall according to some complex probability distribution? This happens all the time. Sometimes we approximate. Other times, we realize that the random variable we are measuring has some comlex underlying dependencies, possibly on other distributions, and we can address these with a model.\n",
    "\n",
    "For example, imagine that we measure the heights of 7th graders in your school and call this our random variable. Later, we realize that the average height of males and females is different by a few inches. We need a model that accounts for the depencence of height on sex at this age.\n",
    "\n",
    "From our data we can might write:\n",
    "\n",
    "$$p(male) = .49$$\n",
    "$$p(female) = .51$$\n",
    "$$p(h) \\propto \\exp{\\frac{(h-\\bar{h})^2}{2 \\sigma_h^2}}$$\n",
    "\n",
    "And now, we know that $\\bar{h}$ and $\\sigma_h$ will have different values for males and females."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This kind of problem is common!  It turns out there are powerful general strategies for dealing with whole classes of proglems of dependencies in the distribution of random variables.\n",
    "\n",
    "## Probabalistic Graph Models (PGMs)\n",
    "\n",
    "In general, we want to model the joint probability distribution for a random variable in terms of other random variables and model parameters.\n",
    "\n",
    "$$X = P(x_1, x_2, \\dots ,x_n)$$ \n",
    "\n",
    "for $n$ random variables. \n",
    "\n",
    "(Let's assume they're are all observable for today.)\n",
    "\n",
    "Now, we can decompose this probability by applying the chain rule:\n",
    "\n",
    "$$P(x_1, x_2, \\dots ,x_n) = P(x_1| x_2, \\dots ,x_n) P(x_2, |x_3, \\dots ,x_n) \\dots P(x_n)$$\n",
    "\n",
    "If we unroll all of the terms, this is a pretty big mess.  But, for many problems there is another simplificaiton. In general, if $x_1$ is independent from $x_2$ then,\n",
    "\n",
    "$$P(x_1|x_2) = P(x_1)$$\n",
    "$$P(x_1, x_2) = P(x_1)P(x_2)$$\n",
    "\n",
    "Also, with Bayes rule, we can invert dependencies:\n",
    "\n",
    "$$P(y|x) = \\frac{P(y)P(x|y)}{P(x)}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Back to our height example...\n",
    "\n",
    "For height and gender, we have,\n",
    "\n",
    "$$P(h,s) = P(h|s)P(s)$$\n",
    "\n",
    "Observations:\n",
    "* We could have decomposed $P(g,h)$ and eneded up with $P(g|h)P(h)$\n",
    "* Bayes lets us swap \"sex is dependent on height\" case to the case where our model says \"height is dependent on sex\"\n",
    "* Direction is somewhat of a choice for simultaneous observations (determining causality is still hard in PGMs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## In general, we can represent this decompostion by a graph\n",
    "\n",
    "To do this, map it like this:\n",
    "* _Nodes_: random variables. \n",
    "* Each node has a Conditional Probability Distribution (CDP).\n",
    "* _Edges_: dependencies \n",
    "\n",
    "There are more tractable graphs and lest tractable graphs for problem solving. An example of one important property is whether P factorizes G.  We say a JPD P _factorizes_ over graph G, if P can be encoded by:\n",
    "\n",
    "$$P(x_1, x_2, \\dots ,x_n) = \\Pi_{i=1}^n P(x_i|Par_G(x_i))$$\n",
    "\n",
    "Where $Par_G(x_i)$ is the parent graph of G."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import networkx as nx\n",
    "\n",
    "G=nx.DiGraph()\n",
    "G.add_edge('sex','height',weight=0.6)\n",
    "nx.draw_networkx(G, node_color='y',node_size=2000, width=3)\n",
    "plt.axis('off')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why is this formalism a useful probabalistic problem solving tool?\n",
    "\n",
    "This tool can model a much general set of Joint Probability Distributions than our simple student-height problem.\n",
    "\n",
    "Some advantages of managing statustical thinking with graphs include:\n",
    "\n",
    "We can use graphs for book-keeping about distributions and dependencies. Practical models can have many nodes. Graphs models help us manage complexity. \n",
    "\n",
    "We can talk generally about classes of graphs and sub-graphs that solve problems. E.g.\n",
    "* Bi-directional Graphs are Markov Models\n",
    "* Directed Acyclic Graphs (DAGs) are called Baysian Models (we will work here the rest of the day)\n",
    "\n",
    "We can state reasoning rules that are constent with the statistical assumptions that allow understand and transform nodes or groups of nodes. E.g. by using rules of graph manipulation to reduce, simplify or cut graphs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learning and Resoning Tasks\n",
    "\n",
    "There a some jobs we need to figure out how to do in order to make this tool practically useful. Maybe in order of obviousness:\n",
    "\n",
    "0. Ask questions about the probabilities represented by the model. I.e. _Inference_:\n",
    "    * Causal Reasoning is looking for downstream effects of assumptions about parents. E.g. Given a male student, how likely is it he will be 165 cm in height?\n",
    "    * Evidential Reasoning is looking for upstream effects of assumptions about childern. E.g. Give a height of 174 cm, what is probability that the student is female? \n",
    "    * Intercausal. Given two observations of a common cause, what can we say about the likelihood of one or other of the \"causal\" measurements?\n",
    "\n",
    "1. Learn parameters of a model given graph structure and representative data (E.g. naive bayes, LSA topic modeling)\n",
    "\n",
    "2. Learn the structure of the graph given representative data\n",
    "\n",
    "3. Build heuristics and rules by which we can reason about graphs. E.g. How to convert Bayese models to Markov models?  What sort of structures allow influence to propogate and which do not? How to simplify complex sub-graphs? In general, learn how to reason about statistics by learning the rules of reasoning about PGMs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tasks and Examples\n",
    "\n",
    "Let's try to look at 1-3.  Also, I am going to switch 0 and 1 so we can have a graph model with parameters to use for inference.\n",
    "\n",
    "Practical problem solving often procees from guessing the graph. Graph discovery may require big, big data and be computationally challenging."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import csv\n",
    "import json\n",
    "\n",
    "from libpgm.graphskeleton import GraphSkeleton\n",
    "from libpgm.nodedata import NodeData\n",
    "from libpgm.discretebayesiannetwork import DiscreteBayesianNetwork\n",
    "from libpgm.tablecpdfactorization import TableCPDFactorization\n",
    "from libpgm.pgmlearner import PGMLearner"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## And now, for some data\n",
    "\n",
    "Survival data from Titanic based on age, sex and travel class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "titanic = pd.DataFrame.from_csv(\"./data/titanic3.csv\", index_col = None)\n",
    "titanic.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "titanic.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A pivot table might give another useful summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ptable = pd.pivot_table(titanic, values=[\"name\"], columns=[\"survived\", \"pclass\",\"sex\"], aggfunc=lambda x: len(x.unique()), margins=True)\n",
    "print ptable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# housekeeping\n",
    "# libpgm needs data as node:value list for each row\n",
    "with open(\"./data/titanic3.csv\") as f:\n",
    "    rdr = csv.reader(f, )\n",
    "    headers = next(rdr, None)\n",
    "    data = [{k:float(v) for k,v in zip(headers, row) if k !=\"name\"} for row in rdr]\n",
    "headers.remove(\"name\")  # not going to model survival based on name\n",
    "#print data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learn Parameters of Graph Model give Data\n",
    "\n",
    "Guess a graph for the model. I guess this:\n",
    "* age determines class -- older people have more money and less patience\n",
    "* survival is determined by sex -- \"women and children first\"\n",
    "* survival is determined by class of travel -- people in steerage had farther to go to get out"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pgn = {\n",
    "    \"V\": headers,\n",
    "    \"E\": [[\"age\", \"pclass\"],\n",
    "        [\"sex\", \"survived\"],\n",
    "         [\"pclass\", \"survived\"]],\n",
    "    \"Vdata\": None }\n",
    "# print pgn\n",
    "G=nx.DiGraph()\n",
    "for f,t in pgn[\"E\"]:\n",
    "    G.add_edge(f,t)\n",
    "nx.draw_networkx(G, node_color='y',node_size=2000, width=3)\n",
    "plt.axis('off')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some choices:\n",
    "* Baysian model (directed graph, probabilities \"propogate\")\n",
    "* Discrete distributions on the nodes (continuous is another world)\n",
    "* A common algorithm for fitting parameters is a maximum likelihood algorithm. There are others.\n",
    "\n",
    "While it is totally worth seeing how it is done, here we just do it to make sure we get to look at examples of most of our tasks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "skel = GraphSkeleton()\n",
    "skel.V = pgn[\"V\"]\n",
    "skel.E = pgn[\"E\"]\n",
    "skel.toporder()\n",
    "\n",
    "learner = PGMLearner()\n",
    "result = learner.discrete_mle_estimateparams(skel, data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the nodes have conditional probability information stored in them. For example,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pd.DataFrame(result.Vdata[\"sex\"][\"cprob\"]).transpose()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pd.DataFrame(result.Vdata[\"age\"][\"cprob\"]).transpose()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's look at a downstream node."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pd.DataFrame(result.Vdata[\"pclass\"][\"cprob\"]).transpose()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Causal Reasoning\n",
    "\n",
    "Set some assumptions and see how this changes marginal probabilities associated with the other nodes in the garph.\n",
    "\n",
    "Note: Querying change the graph information--the graph has to be recalculated for the stated assumptions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# use our solutions from above\n",
    "nd = NodeData()\n",
    "nd.Vdata = result.Vdata\n",
    "nd.alldata = None\n",
    "bn = DiscreteBayesianNetwork(skel, nd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# query alters tables\n",
    "tcpd = TableCPDFactorization(bn)\n",
    "print \"What is p(male=0)? {:.3%}\".format(\n",
    "        tcpd.specificquery(dict(sex=[1]), dict())\n",
    ")\n",
    "\n",
    "tcpd = TableCPDFactorization(bn)\n",
    "print \"What is p(female=1)? {:.3%}\".format(\n",
    "        tcpd.specificquery(dict(sex=[0]), dict())\n",
    ")\n",
    "\n",
    "# query alters tables\n",
    "tcpd = TableCPDFactorization(bn)\n",
    "print \"What is p(female=1,survived=1)? {:.3%}\".format(\n",
    "         tcpd.specificquery(dict(sex=[1]), dict(survived=1))\n",
    ")\n",
    "\n",
    "# query alters tables\n",
    "tcpd = TableCPDFactorization(bn)\n",
    "print \"What is p(male=0,survived=0)? {:.3%}\".format(\n",
    "         tcpd.specificquery(dict(sex=[0]), dict(survived=0))\n",
    ")\n",
    "\n",
    "# query alters tables\n",
    "tcpd = TableCPDFactorization(bn)\n",
    "print \"What is p(male=0,class=3,survived=0)? {:.3%}\".format(\n",
    "         tcpd.specificquery(dict(sex=[0],pclass=[3.0]), dict(survived=0))\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# maybe useful for comparison\n",
    "pd.pivot_table(titanic, values=[\"name\"], columns=[\"sex\", \"pclass\",\"survived\"], aggfunc=lambda x: len(x.unique()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learn Graph Structure\n",
    "\n",
    "One general strategy uses a score to determine structures\n",
    "1. chose a score (for example AIC or BIC).  \n",
    "2. grab a node, make an edge\n",
    "3. calculate a model based on data\n",
    "4. calculate score\n",
    "5. if AIC goes down, keep the edge\n",
    "6. go to 2 until you run out of edges to try\n",
    "\n",
    "This is a lot of computation. People have tried all kinds of heuristics, simplification and sophisticated calculation caching schemes.  Some cases can to 1000s of nodes in hours.\n",
    "\n",
    "\n",
    "Another scheme uses constraints optimizaiton. That's what we will do..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# instantiate my learner \n",
    "learner = PGMLearner()\n",
    "\n",
    "# estimate structure\n",
    "result = learner.lg_constraint_estimatestruct(data, indegree=1)\n",
    "\n",
    "# output\n",
    "print json.dumps(result.E, indent=2)\n",
    "print json.dumps(result.V, indent=2)\n",
    "\n",
    "G=nx.DiGraph()\n",
    "for f,t in result.E:\n",
    "    G.add_edge(f,t,weight=0.6)\n",
    "nx.draw_networkx(G, node_color='y',node_size=2000, width=3)\n",
    "plt.axis('off')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## different model, so learn new parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "skel = GraphSkeleton()\n",
    "skel.V = result.V\n",
    "skel.E = result.E\n",
    "skel.toporder()\n",
    "\n",
    "learner = PGMLearner()\n",
    "result = learner.discrete_mle_estimateparams(skel, data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Queries with New Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "nd = NodeData()\n",
    "nd.Vdata = result.Vdata\n",
    "nd.alldata = None\n",
    "bn = DiscreteBayesianNetwork(skel, nd)\n",
    "\n",
    "# query alters tables\n",
    "tcpd = TableCPDFactorization(bn)\n",
    "print \"What is p(male=0)? {:.3%}\".format(\n",
    "        tcpd.specificquery(dict(sex=[1]), dict())\n",
    ")\n",
    "\n",
    "tcpd = TableCPDFactorization(bn)\n",
    "print \"What is p(female=1)? {:.3%}\".format(\n",
    "        tcpd.specificquery(dict(sex=[0]), dict())\n",
    ")\n",
    "\n",
    "# query alters tables\n",
    "tcpd = TableCPDFactorization(bn)\n",
    "print \"What is p(female=1,survived=1)? {:.3%}\".format(\n",
    "         tcpd.specificquery(dict(sex=[1]), dict(survived=1))\n",
    ")\n",
    "\n",
    "# query alters tables\n",
    "tcpd = TableCPDFactorization(bn)\n",
    "print \"What is p(male=0,survived=0)? {:.3%}\".format(\n",
    "         tcpd.specificquery(dict(sex=[0]), dict(survived=0))\n",
    ")\n",
    "\n",
    "# query alters tables\n",
    "tcpd = TableCPDFactorization(bn)\n",
    "print \"What is p(male=0,class=3,survived=0)? {:.3%}\".format(\n",
    "         tcpd.specificquery(dict(sex=[0],pclass=[3.0]), dict(survived=0))\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Graph types and reasoning about probabilities\n",
    "\n",
    "Lots to do here, but for another day. A short read here hints at how much I skipped: https://en.wikipedia.org/wiki/Graphical_model\n",
    "        \n",
    "Some other types:\n",
    "* A factor graph is an undirected bipartite graph connecting variables and factors. Each factor represents a function over the variables it is connected to. This is a helpful representation for understanding and implementing belief propagation.\n",
    "* A clique tree or junction tree is a tree of cliques, used in the junction tree algorithm.\n",
    "* A chain graph is a graph which may have both directed and undirected edges, but without any directed cycles (i.e. if we start at any vertex and move along the graph respecting the directions of any arrows, we cannot return to the vertex we started from if we have passed an arrow). Both directed acyclic graphs and undirected graphs are special cases of chain graphs, which can therefore provide a way of unifying and generalizing Bayesian and Markov networks.[2]\n",
    "* An ancestral graph is a further extension, having directed, bidirected and undirected edges.[3]\n",
    "* A conditional random field is a discriminative model specified over an undirected graph.\n",
    "* A restricted Boltzmann machine is a bipartite generative model specified over an undirected graph.\n",
    "\n",
    "But maybe I gave your a place to start?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
